#Necessary Packages
library(tidyverse)
library(RColorBrewer)
library(mosaic)
library(mdsr)
library(RColorBrewer)
library(sp)
library(sf)DSC365: Introduction to Data Science
What is Spatial Data?
Spatial data relates to a geographical area or location:
- Tobler’s First Law of Geography: things close in space are more related than things further in space.
- Going to talk about how to manage and visualize spatial data in R. Modeling goes beyond the scope of this class.
Common Spatial Data Structure: Shapefiles
- Spatial data consists of geometric items like points, lines, and polygons; not rows and columns
- Contain “instructions” on drawing boundaries for countries, etc.
Package focus: sf
tidyversefriendly package that contains commonly used functions for spatial objects
Visualizing Spatial Data
Example: You know nothing, John Snow…
In the 1850s, there was a cholera outbreak in Victorian London. A physician, Dr. John Snow, took a unique approach to studying the outbreak: he mapped where the patients lived.
- If you want to learn more: https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak
head(CholeraDeaths)Simple feature collection with 6 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 529308.7 ymin: 181006 xmax: 529336.7 ymax: 181031.4
Projected CRS: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs
Id Count geometry
1 0 3 POINT (529308.7 181031.4)
2 0 2 POINT (529312.2 181025.2)
3 0 1 POINT (529314.4 181020.3)
4 0 1 POINT (529317.4 181014.3)
5 0 4 POINT (529320.7 181007.9)
6 0 2 POINT (529336.7 181006)
coords <- do.call(rbind, st_geometry(CholeraDeaths)) %>% as.data.frame() %>% setNames(c("X","Y"))
coords$count = CholeraDeaths$Count
coords%>% ggplot(aes(x=X, y=Y)) + geom_point(aes(col = count))Problem… this is not a map. It is hard to see where those points are in 1850s London.
Let’s add some context using the coordinates (units in degrees), but still lacking context
ggplot(CholeraDeaths) +
geom_sf()Add London Street Map:
library(ggspatial)
ggplot(CholeraDeaths) +
annotation_map_tile(type = "osm", zoomin = 0) +
geom_sf(aes(size = Count), alpha = 0.7)
|
| | 0%
|
|==== | 6%
|
|========= | 12%
|
|============= | 19%
|
|================== | 25%
|
|====================== | 31%
|
|========================== | 38%
|
|=============================== | 44%
|
|=================================== | 50%
|
|======================================= | 56%
|
|============================================ | 62%
|
|================================================ | 69%
|
|==================================================== | 75%
|
|========================================================= | 81%
|
|============================================================= | 88%
|
|================================================================== | 94%
|
|======================================================================| 100%
Map Projections:
A note of map projections:
- In this map both
CholeraDeathsand the map tiles retrieved by theggspatialpackage have geospatial coordinates, but those coordinates are not in the same unit - Earth happens to be an oblate spheroid—a three-dimensional flattened sphere. Yet we would like to create two-dimensional representations of the Earth that fit on pages or computer screens
- Converting from a 3d coordinate system to a 2d is called a projection
- How you project your data influences how your map is perceived.
- Would need to update the
coordinate reference systemofChloreaDeathsfo fix this problem
library(mapproj)
library(maps)
map("world", projection = "mercator", wrap = TRUE)map("world", projection = "cylequalarea", param = 45, wrap = TRUE)Let’s try to fix our map projections
sf::st_crs(CholeraDeaths)Coordinate Reference System:
User input: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs
wkt:
PROJCRS["unknown",
BASEGEOGCRS["unknown",
DATUM["Unknown based on Airy 1830 ellipsoid",
ELLIPSOID["Airy 1830",6377563.396,299.3249646,
LENGTHUNIT["metre",1,
ID["EPSG",9001]]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8901]]],
CONVERSION["unknown",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",49,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",-2,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",0.9996012717,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",400000,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",-100000,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1,
ID["EPSG",9001]]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1,
ID["EPSG",9001]]]]
cholera_latlong <- CholeraDeaths %>%
st_set_crs(27700) %>%
st_transform(4326)
ggplot(cholera_latlong) +
annotation_map_tile(type = "osm", zoomin = 0) +
geom_sf(aes(size = Count))
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
Other methods to visualizing Spatial Data
Google API key
#devtools::install_github("dkahle/ggmap")
library(ggmap)
m <- get_map('John Snow, London, England', maptype='roadmap')
ggmap(m)In mid-2018, Google Maps began requiring users to have a registered API key.
- Users have to set up an account with Google, enable the relevant APIs, and then tell R about the user’s set up.
- Part of this process requires a credit card. It’s totally lame. Don’t do it.
Instructions to use Google API key (if you insist):
- To obtain an API key and enable services, go to https://cloud.google.com/maps-platform/.
- To tell ggmap about your API key, use register_google(), e.g. register_google(key = “mQkzTpiaLYjPqXQBotesgif3EfGL2dbrNVOrogg”) (that’s a fake key). This will set your API key for the current session, but if you restart R, you’ll need to do it again.
Free Alternative: tmap
library(tmap)
map <- qtm(CholeraDeaths, symbols.col = "Count", symbols.id="Count")
tmap_leaflet(map)Another Free Alternative: leaflet
Leaflet is an open-source JavaScript library for interactive maps. This R package makes it easy to create Leaflet maps from R.
- Website for more information: https://rstudio.github.io/leaflet/
library(leaflet)
library(readr)
bigfoot_data <- read_csv("https://query.data.world/s/egnaxxvegdkzzrhfhdh4izb6etmlms")
bigfoot_data %>%
filter(classification == "Class A") %>%
mutate(seasoncolor = str_replace_all(season, c("Fall" = "orange",
"Winter" = "skyblue",
"Spring" = "green",
"Summer" = "yellow")),
# This code just wraps the description to the width of the R terminal
# and inserts HTML for a line break into the text at appropriate points
desc_wrap = purrr::map(observed, ~strwrap(.) %>%
paste(collapse = "<br/>") %>%
htmltools::HTML())) %>%
leaflet() %>%
addTiles() %>%
addCircleMarkers(~longitude, ~latitude, color = ~seasoncolor, label = ~desc_wrap)Another Example: Education levels in the US
The Education.csv data set contains information on educational attainment for each county in the United States.
First: We need Nebraska shapefiles to create the boundaries. We are going to use the tigris package, which includes shapefiles from the US Census Bureau
library(tigris)
US_counties <- counties(cb = TRUE)
US_counties <- US_counties %>% arrange(GEOID)
glimpse(US_counties)Education <- read.csv("Education.csv")US_counties$GEOID = as.numeric(US_counties$GEOID)
Education$FIPS.Code = as.numeric(Education$FIPS.Code)
Education_Counties <- US_counties %>% inner_join(Education, by=c('GEOID'='FIPS.Code'))
glimpse(Education_Counties)Rows: 3,211
Columns: 27
$ STATEFP <chr> "01", "01", "01", "01", "01", "01", "01", "01", "…
$ COUNTYFP <chr> "001", "003", "005", "007", "009", "011", "013", …
$ COUNTYNS <chr> "00161526", "00161527", "00161528", "00161529", "…
$ AFFGEOID <chr> "0500000US01001", "0500000US01003", "0500000US010…
$ GEOID <dbl> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, 1…
$ NAME <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount"…
$ NAMELSAD <chr> "Autauga County", "Baldwin County", "Barbour Coun…
$ STUSPS <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "…
$ STATE_NAME <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
$ LSAD <chr> "06", "06", "06", "06", "06", "06", "06", "06", "…
$ ALAND <dbl> 1539631461, 4117724893, 2292160151, 1612188713, 1…
$ AWATER <dbl> 25677536, 1132887353, 50523213, 9572302, 14860281…
$ State <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "…
$ Area.name <chr> "Autauga County", "Baldwin County", "Barbour Coun…
$ X2003Rural.Urban <int> 2, 4, 6, 1, 1, 6, 6, 3, 6, 8, 1, 9, 7, 9, 8, 6, 3…
$ X2003Urban.Influence <int> 2, 5, 6, 1, 1, 6, 6, 2, 5, 6, 1, 10, 11, 10, 4, 5…
$ X2013Rural.Urban <int> 2, 3, 6, 1, 1, 6, 6, 3, 6, 6, 1, 9, 7, 9, 8, 4, 3…
$ X2013Urban.Influence <int> 2, 2, 6, 1, 1, 6, 6, 2, 5, 6, 1, 10, 11, 10, 4, 5…
$ LessHS2000 <dbl> 21.3, 18.0, 35.3, 36.8, 29.6, 39.5, 32.2, 26.1, 3…
$ HSonly2000 <dbl> 33.8, 29.6, 32.4, 35.7, 36.0, 35.2, 34.5, 32.2, 3…
$ SomeCollege2000 <dbl> 26.9, 29.3, 21.3, 20.4, 24.8, 17.5, 22.9, 26.4, 2…
$ CollegePlus2000 <dbl> 18.0, 23.1, 10.9, 7.1, 9.6, 7.7, 10.4, 15.2, 9.5,…
$ LessHS2013 <dbl> 12.3, 9.8, 26.9, 17.9, 20.2, 28.6, 18.9, 16.8, 19…
$ HSonly2013 <dbl> 33.6, 27.8, 35.5, 43.9, 32.3, 36.6, 40.4, 32.2, 3…
$ SomeCollege2013 <dbl> 29.1, 31.7, 25.5, 25.0, 34.4, 21.4, 24.5, 33.1, 2…
$ CollegePlus2013 <dbl> 25.0, 30.7, 12.0, 13.2, 13.1, 13.4, 16.1, 17.9, 1…
$ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-86.9212 32..., MULT…
Plot education levels by county in Nebraska: less than a high school degree.
NE_Education <- Education_Counties %>% filter(State=='NE')
favstats(~LessHS2013, data=NE_Education) min Q1 median Q3 max mean sd n missing
1.1 6.1 7.8 9.9 28.8 8.548387 4.329188 93 0
ggplot2
ggplot(NE_Education) + geom_sf(aes(fill = LessHS2013)) + scale_fill_distiller(palette='Blues', direction=+1)ggplot(NE_Education) + geom_sf(aes(fill = LessHS2013)) + scale_fill_distiller(palette='Blues', direction=+1) + geom_sf_label(aes(label = NAME), fill = "white", size = 2)leaflet
greens <- colorBin("Greens", domain = c(0, 30), bins = 6)
leaflet(NE_Education) %>%
addTiles() %>%
addPolygons(color = "black",
opacity = 1,
weight = 1,
fillColor = ~greens(LessHS2013),
fillOpacity = 0.7,
label = ~paste0("County: ", Area.name, ' (', LessHS2013, '%)')) %>%
addLegend(position = 'bottomleft',
colors = greens(seq(from=2.5, to=27.5, length=6)),
labels = c('0-5%', '5.1-10%', '10.1-15%', '15.1-20%', '20.1-25%', '25.1-30%'),
title = 'Less than a HS Degree (2013)')Try Yourself!
One variable we might be interested in is the margin: difference between percentage of those graduating with a High School only degree between 2013 and 2000. Try to make a new variable called Margin which represents the difference between those graduating from High School only, and then make a plot.
NE_Education <- NE_Education %>% mutate(MarginDiff = HSonly2013-HSonly2000)
ggplot(NE_Education) + geom_sf(aes(fill = MarginDiff)) + scale_fill_distiller(palette='RdBu', direction=-1)palette <- colorBin("RdBu", domain = c(-70, 70), bins = 9, reverse = TRUE)
leaflet(NE_Education) %>%
addTiles() %>%
addPolygons(color = "black",
opacity = 1,
weight = 1,
fillColor = ~palette(MarginDiff),
fillOpacity = 0.7,
label = ~paste0("Name: ", NAME))Another Example: Election Results
Daily Kos compiled results by congressional district for the 2008, 2012, and 2016 presidential elections, this data is saved as CongressionalElections.csv. Let’s use this data to visualize election results in the US.
FiveThirtyEightidentifies the states of Colorado, Florida, Iowa, Michigan, Minnesota, Ohio, Nevada, New Hampshire, North Carolina, Pennsylvania, Virginia, and Wisconsin as “perennial” swing states that have regularly seen close contests over the last few presidential campaigns.
CongressionalElections <- read.csv("CongressionalElections.csv")
glimpse(CongressionalElections)Rows: 435
Columns: 9
$ CD <chr> "AK-AL", "AL-01", "AL-02", "AL-03", "AL-04", "AL-05", "AL-…
$ Incumbent <chr> "Young, Don", "Byrne, Bradley", "Roby, Martha", "Rogers, M…
$ Party <chr> "(R)", "(R)", "(R)", "(R)", "(R)", "(R)", "(R)", "(D)", "(…
$ Clinton2016 <dbl> 37.6, 34.1, 33.0, 32.3, 17.4, 31.3, 26.1, 69.8, 30.2, 41.7…
$ Trump2016 <dbl> 52.8, 63.5, 64.9, 65.3, 80.4, 64.7, 70.8, 28.6, 65.0, 52.4…
$ Obama2012 <dbl> 41.2, 37.4, 36.4, 36.8, 24.0, 34.9, 24.7, 72.4, 36.3, 42.9…
$ Romney2012 <dbl> 55.3, 61.8, 62.9, 62.3, 74.8, 63.9, 74.3, 27.1, 61.0, 54.7…
$ Obama2008 <dbl> 38.1, 38.5, 35.0, 36.6, 25.5, 36.3, 25.0, 71.5, 39.2, 44.3…
$ McCain2008 <dbl> 59.7, 60.9, 64.5, 62.6, 73.3, 62.6, 74.1, 28.1, 58.0, 53.8…
Let’s look at the swing state of North Carolina
NC_congressional <- congressional_districts(cb = TRUE, state = "North Carolina", year = 2016) NC_congressional <- NC_congressional %>% arrange(CD115FP)
NC_data <- CongressionalElections %>% filter(str_detect(CD, 'NC'))
NC_congressional$Clinton2016 <- NC_data$Clinton2016
NC_congressional$Trump2016 <- NC_data$Trump2016
NC_congressional <- NC_congressional %>% mutate(Margin2016 = Trump2016-Clinton2016)
ggplot(NC_congressional) + geom_sf(aes(fill = Margin2016)) + scale_fill_distiller(palette='RdBu', direction=-1)leaflet(NC_congressional) %>%
addTiles() %>%
addPolygons(color = "black",
opacity = 1,
weight = 1,
fillColor = ~palette(Margin2016),
fillOpacity = 0.7,
label = ~paste0("District: ", CD115FP))